Conceptual Clustering of Korean Concordances Using Similarity between Morphemes

نویسندگان

  • Dae-Ho Baek
  • Ho Lee
  • Hae-Chang Rim
چکیده

This paper describes a method for the conceptual clustering of Korean concordances. We present a method of computing conceptual similarity between concordances using the number of cooccurring morphemes and the similarities between morphemes. We use mutual information, the similarity between mutual information values and vector similarity to compute similarity between morphemes. When we try to cluster the concordances extracted from about 170,000 word size part of speech tagged corpus, the system correctly clusters 90.16% of concordances. The proposed method has shown that the information about similarity between morphemes can be used to disambiguate word senses.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Empirical Comparison of Distance Measures for Multivariate Time Series Clustering

Multivariate time series (MTS) data are ubiquitous in science and daily life, and how to measure their similarity is a core part of MTS analyzing process. Many of the research efforts in this context have focused on proposing novel similarity measures for the underlying data. However, with the countless techniques to estimate similarity between MTS, this field suffers from a lack of comparative...

متن کامل

Frequency Effects

This study accounts for Korean /n/-epenthesis from a usage-based perspective, by describing the reduced productivity of epenthesis as an analogical change in progress. We found that epenthesis probability rises as whole-word frequency increases, supporting the hypothesis that analogical change begins in lowfrequency words (Bybee 2002). We interpret the findings as support for the idea that freq...

متن کامل

Syllable-Pattern-Based Unknown-Morpheme Segmentation and Estimation for Hybrid Part-of-Speech Tagging of Korean

Most errors in Korean morphological analysis and part-of-speech (POS) tagging are caused by unknown morphemes. This paper presents a syllable-pattern-based generalized unknownmorpheme-estimation method with POSTAG (POStech TAGger), which is a statistical and rule-based hybrid POS tagging system. This method of guessing unknown morphemes is based on a combination of a morpheme pattern dictionary...

متن کامل

A New Method for Duplicate Detection Using Hierarchical Clustering of Records

Accuracy and validity of data are prerequisites of appropriate operations of any software system. Always there is possibility of occurring errors in data due to human and system faults. One of these errors is existence of duplicate records in data sources. Duplicate records refer to the same real world entity. There must be one of them in a data source, but for some reasons like aggregation of ...

متن کامل

Improving Imbalanced data classification accuracy by using Fuzzy Similarity Measure and subtractive clustering

 Classification is an one of the important parts of data mining and knowledge discovery. In most cases, the data that is utilized to used to training the clusters is not well distributed. This inappropriate distribution occurs when one class has a large number of samples but while the number of other class samples is naturally inherently low. In general, the methods of solving this kind of prob...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007